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Abstract. This paper shows how universal learning can be achieved 
with expert advice. To this aim, we specify an experts algorithm with 
the following characteristics: (a) it uses only feedback from the actions 
actually chosen (bandit setup), (b) it can be applied with countably infi- 
nite expert classes, and (c) it copes with losses that may grow in time ap- 
propriately slowly. We prove loss bounds against an adaptive adversary. 
From this, we obtain a master algorithm for "reactive" experts problems, 
which means that the master's actions may influence the behavior of the 
adversary. Our algorithm can significantly outperform standard experts 
algorithms on such problems. Finally, we combine it with a universal ex- 
pert class. The resulting universal learner performs - in a certain sense - 
almost as well as any computable strategy, for any online decision prob- 
lem. We also specify the (worst-case) convergence speed, which is very 
slow. 

Keywords. Prediction with expert advice, responsive environments, 
partial observation game, bandits, universal learning, asymptotic opti- 
mality. 



1 Introduction 

Expert advice has become a well-established paradigm of machine learning in the 
last decade, in particular for prediction. It is very appealing from a theoretical 
point of view, as performance guarantees usually hold in the worst case, without 
any (statistical) assumption on the data. Such assumptions are generally required 
for other statistical learning methods, often however not resulting in stronger 
guarantees. 

Using expert advice in the standard way seems a rather bad idea in some 
cases where the decisions of the learner or master algorithm influence the be- 
havior of the environment or adversary. One example is the repeated prisoner's 
dilemma when the opponent plays "tit for tat" (see Section^. This was noted 
and resolved by P , who introduced a "strategic expert algorithm" for so-called 



reactive environments. Their algorithm works with a finite class of experts and 
attains asymptotically optimal behavior. No convergence speed is asserted, and 
the analysis is quite different from that of standard experts algorithms. 

In this paper, we show how the more general task with a countably infinite 
expert class can be accomplished, building on standard experts algorithms, and 
simultaneously also bounding the convergence rate (t~w^ which can be actually 
improved to i~3+^). To this aim, we will combine techniques from |2I3I4I5| and 
obtain a master algorithm which performs well on loss functions that may in- 
crease in time. Then this is applied to (possibly) reactive problems by yielding 
the control to the selected expert for an increasing period of time steps. Using a 
universal expert class defined by the countable set of all programs on some fixed 
universal Turing machine, we obtain an algorithm which is in a sense asymp- 
totically optimal with respect to any computable strategy. An easy additional 
construction guarantees that our algorithm is computable, in contrast to other 
universal approaches which are non-computable jSj. To our knowledge, we also 
propose the first algorithm for non-stochastic bandit problems with countably 
many arms. 

The paper is structured as follows. Section [21 introduces the problem setup, 
the notation, and the algorithm. In Sections 13 we give the (worst-case) analysis 
of the master algorithm. The implications to active experts problems and a 
universal master algorithms are given in Section 01 We discuss our results in 
Section 

2 The master algorithm 

Setup. We are acting in an online decision problem. "We" is here an abbre- 
viation for the master algorithm which is to be designed. An "online decision 
problem" is to be understood in a very general sense, it is just a sequence of 
decisions each of which results in some loss. This could be e.g. a prediction task, 
a repeated game, etc. In each round, that is at each time step t, we have access to 
the recommendations of countably infinitely many "experts" or strategies. (For 
simplicity, we restrict our notation to a countably infinite expert class, all results 
also hold for finite classes.) We do not specify what exactly a "recommendation" 
is - we just follow the advice of one expert. Before we reveal our move, the 
adversary has to assign losses > to all experts i. There is an upper bound 
Bt > IK* I loo on the maximum loss the adversary may use. This quantity may 
depend on t and is not controlled by the adversary. After the move, only the loss 
of the selected expert i is revealed. Our goal is to perform nearly as well as the 
best available strategy (expert) in terms of cumulative loss, after any number 
T of time steps which is not known in advance. The difference between our loss 
and the loss of some expert is also termed regret. We consider the general case of 
an adaptive adversary, which may assign losses depending on our past decisions. 

If there are only finitely many experts or strategies, then it is common to give 
no prior preferences to any of them. Formally, this is realized by defining uniform 
prior weights = — for each expert i. This is not possible for countably infinite 



For t = l,2,3,... 

set ii = Bt for i ^{t>T} (see ©) 

sample rt £ {0, 1} independently s.t. P[rt = 1] = 7t 

If rt = Then 

invoke FPL{t) and play its decision 

set ii = for i G > r} 
Else 

sample 1^°^ :— ut "uniformly", see Q, and play I :— ll 
set il = el/{ul'yt) and ij = for i e > r} \ {/} 



Fig. 1. The algorithm FoE. The exploration rate 74 will be specified in Corollary 

E 



Sample ql ~ Exp (i.e. P((7t > x) ^ e for a; > 0) indep. Vi £ {t > r} 
select and play 1^ ^ — arg minjr/t^^j + fc' — g^} 



Fig. 2. The algorithm FPL. The learning rate rjt will be specified in Corollary|Sl 



expert classes, as there is no uniform distribution on the natural numbers. In this 
case, we need some non-uniform prior (w^)i^fi and require > for all experts 
i and < 1. We also define the complexity of expert j as fc* = — In w*. This 

quantity is important since in the full observation game (i.e. after our decision 
we get to know the losses of all experts), the regret can usually be bounded by 
some function of the best expert's complexity. 

Our algorithm "Follow or Explore" {FoE, specified in Fig.^l builds on McMa- 
han and Blum's online geometric optimization algorithm 0]. It is a bandit version 
of a "Follow the Perturbed Leader" experts algorithm. This approach to online 
prediction and playing repeated games has been pioneered by Hannan For 
the full observation game and uniform prior, T gave a very elegant analysis 
which is clearly different from the standard analysis of exponential weighting 
schemes. It has one advantage over other aggregating algorithms such as expo- 
nential weighting schemes: the analysis is not complicated if the learning rate is 
dynamic rather than fixed in advance. A dynamic learning rate is necessary if 
there is no target time T known in advance. For non-uniform prior, an analysis 
was given in [Fj. The following issues are important for FoE's design. 

Exploration. Since we are playing the bandit game (as opposed to the full 
information game), we need to explore sufficiently |7I4| . At each time step t, 
we decide randomly according to some exploration rate S (0, 1) whether to 
explore or not. If so, we would like to choose an expert according to the prior 
distribution. There is a caveat: In order to make the analysis go through, we 
have to assure that we are working with unbiased estimates of the losses. This is 
achieved by dividing the observed loss by the probability of choosing the expert. 
But this quantity could become arbitrarily large if we admit arbitrarily small 



weights. We address this problem by finitizing the expert pool at each time t. 
For each expert i, we define an entering time t*, that is, expert i is active only 
for t> t"^ . We denote the set of active experts at time t by {i > r} = {z : t > r*}. 
For exploration, the prior is then replaced by the finitized prior distribution ut, 

Consequently, the maximum unbiasedly estimated instantaneous loss is (note 
that the exploration probability also scales with the exploration rate 74) 

7t mmjw* : t > r' j 

It is convenient for the analysis to assign estimated loss of Bt to all currently 
inactive experts. Observe finally that in this way, our master algorithm FoE 
always deals with a finite expert class and is thus computable. 
Follow the perturbed leader [FPL, specified in Fig.[2Jl is invoked if FoE does 
not explore. Just following the "leader" (the best expert so far) may not be a 
good strategy . Instead we subtract an exponentially distributed perturbation 
qt from the current score (the complexity penalized past loss) of the experts. An 
important detail of the FPL subroutine is the learning rate rjt > 0, which should 
be adaptive if the total number of steps T is not known in advance. Please see 
e.g. |3I5| for more details. Also the variant of FPL we use (specified in Fig. [2)) 
works on the finitized expert pool. 

Note that each time randomness is used, it is assumed to be independent of 
the past randomness. Performance is evaluated in terms of true or estimated 
cumulative loss, this is specified in the notation. E.g. for the true loss of FPL up 
to and including time T we write while the estimated loss of FoE and not 
including time T is . 



3 Analysis on the master's time scale 

The following analysis uses McMahan and Blum's trick 0] in order to prove 
bounds against adaptive adversary. With a different argument, it is possible 
to circumvent Lemma |5| thus achieving better bounds 8 . This will be briefiy 
discussed in the last section. 

Let Bt > Q he some sequence of upper bounds on the instantaneous losses, 
7t € (0, 1) be a sequence of exploration rates, and rjt > he a decreasing se- 
quence of learning rates. The analysis proceeds according to the following di- 
agram (where L is an informal abbreviation for the loss and always refers to 
cumulative loss, but sometimes additionally to instantaneous loss). 

LFoE < g^FoB < ^j^FPL < ^IFPL < g^/FPL < ^Ibest < j^best (-3) 

Each " ^ " means that we bound the quantity on the left by the quantity on the 
right plus some additive term. The first and the last expressions are the losses of 



Sample ql ~ Exp independently for all i & {t > t} 
select and play if^^ = arg min {rjtti-t + fc* — gj} 



Fig. 3. The algorithm IFPL. The learning rate rjt will be specified in Corollary 

El 



the FoE algorithm and the best expert, respectively. The intermediate quantities 
belong to different algorithms, namely FoE, FPL, and a third one called IFPL for 
"infeasible" FPL 3 . IFPL, as specified in Fig. 13 is the same as FPL except that 
it has access to an oracle providing the current estimated loss vector £t (hence 
infeasible). Then it assigns scores of T]ti\.f -\- — ql instead of rjtt^i + fc* — ql- 

The randomization of FoE and FPL gives rise to two filtrations of a-algebras. 
By At we denote the cr-algebra generated by the FoE's randomness up to time 
t, meaning only the random variables {ui-t,ri-t}. Then {At)t>o is a filtration 
(^0 is the trivial cr-algebra). We may also write A = Ut>o-^*- Similarly, Bt is 
the cr-algebra generated by the FoE's and FPUs randomness up to time t (i.e. 
Bt={ui;t,ri.,t,qi:t\)- Then clearly At C Bt for each t. 

The reader should think of the expectations in (PJ as of both ordinary and 
conditional expectations. Conditional expectations are mostly with respect to 
FoE^s past randomness At~i- These conditional expectations of some random 
variable X are abbreviated by 

Et[X] :=E[X|A-i]. 

Then Et[X] is an ^f_i -measurable random variable, meaning that its value is 
determined for fixed past randomness At-i - Note in particular that the estimated 
loss vectors £1 are random vectors which depend on FoE^s randomness At up to 
time t. In this way, FoE^s (and FPUs and IFPUs) actions depend on FoE^s past 
randomness. Note, however, that they do not depend on FPL's randomness qi-t- 
We now start with proving the diagram Q. In order to understand the 
analysis, it is important to consider each intermediate algorithm as a stand- 
alone procedure which is actually executed (with an oracle if necessary) on the 
specified inputs (e.g. on the estimated losses) and has the asserted performance 
guarantees (e.g. again on the estimated losses). 

Lemma 1. [L^°^ < EL^°^] For each T > 1 and St € (0, 1), with probability at 
least 1 — ^ we have 

t=i 

Proof. The sequence of random variables Xt = X)t=i [^t°^ ~ Et€f°^] is a mar- 
tingale with respect to the filtration Bt (not At^-)- In order to see this, observe 



F,[ei^'''^\BT-i] = F,{F,[ei^°'^\AT~i]\BT-i) and E[£[°'^\Bt-i] = if"^ for t < T, 
which imphes 

nXrlBT-i] = Y.'t^, {nC''\BT-i] - E[E[^f°^|A-i]|ST-i]) 

Its differences are bounded: \Xt — Xt^i\ < Bt- Hence, it follows from Azuma's 
inequality (see e.g. [5]) that the probability that Xt exceeds some A > is 
bounded by p = 2exp ( — 2^~b^)- Requesting ^ — p and solving for A gives 
the assertion. □ 



Lemma 2. [E£^°^ < Ei^] Ej^f"^ < (1 - lt)Ett^^ + 7*5* holds Vt > 1. 

This follows immediately from the specification of FoE . Clearly, a correspond- 
ing assertion for the ordinary expectations holds by just taking expectations on 
both sides. This is the case for all subsequent lemmas, except for Lemma IHl 

The next lemma relating EL-™ and EL-™ is technical but intuitively clear. 
It states that in expectation, the real loss suffered by FPL is the same as the 
estimated loss. This is simply because the loss estimate is unbiased. A combina- 
tion of this and the previous lemma was shown in ^ . Note that If^^ is the loss 
q estimated by FoE^ but for the expert / — if^^ chosen by FPL. 

Lemma 3. [EL^^ < EL^^] For each t>l, we have Edf^^ = E*!™. 

Proof. Let fl — P[/f™' = be the probability distribution over actions i 

which FPL uses at time depending on the past randomness At-i- Let u* be 
the finitized prior distribution at time t. Then 

C30 oo 

E*[if^](l - . -f 7t E /* [(1 - "i) • + zij^l|.,=M/-=.J E /t^t = E*[^f^], 

where ^t|rt=iA/™=i = ^t/("t7t) the estimated loss under the condition that 
FoE decided to explore (r* = 1) and chose action = i. □ 



The following lemma relates the losses of FPL and IFPL. It is proven in |3] 
and ■ We give the full proof, since it is the only step in the analysis where we 
have to be careful with the upper loss bound Bt . Let B* be the upper bound on 
the estimated loss (jSJ . (We remark that also for weighted averaging forecasters, 
losses which grow sufficiently slowly do not cause any problem in the analysis. 
In this way, it is straightforward to modify the algorithm by Auer et al. ^U] for 
reactive tasks with a finite expert class.) 

Lemma 4. [EL™^ < EL™] For all t > 1, Ejf^^ < Edf^^ + jtVtB^ holds. 



Proof. If n = 0, it ^0 and thus ^ = ^™ holds. This happens with proba- 
bility 1 ~ 7f . Otherwise we have 



where denotes the (exponential) distribution of the perturbations, i.e. Xi := ql 
and density ^(x) := e~ll^ll°°. The idea is now that if action i was selected by 
FPL, it is - because of the exponentially distributed perturbation - with high 
probability also selected by IFPL. Formally, we write u+ = max(M, 0) for u G R, 
abbreviate A = £<t + k/rjt, and denote by / . . . d^{x^i) the integration leaving 
out the ith action. Then, using 774 Ai — Xi < rjtXj ~ Xj Vj if if^^ = i in the first 
equation, and Bt > ~ ^^^t line, we get 



/ 



Ti-jFPL^^t\dii{x) — Ptd^Ji{xi)dp,{x^i) — I l\e dp,{x^i) 



^ I Xi mB, -(max{r,t(Ai-Aj)+a:j}+r)tBt) + 



< e'"-"' ile dfj.{x^i) 



Summing over i and using the analog of for IFPL, we see that if rt = 1, then 
Eti^^ < e'"-^*Etl™ holds. ThusEtf™ > Q-'^^'Eti^^ > {l-'qtBt)Etif^ > 
Etif^^ - r]tB'^. The assertion now follows by taking expectations w.r.t. rt. □ 



The next lemma relates the losses of IFPL and the best action in hindsight. 
For an oblivious adversary (which means that the adversary's decisions do not 
depend on our past actions), the proof is quite simple [3]. An additional step is 
necessary for an adaptive adversary 

Lemma 5. [EL^^^ < EL^'^'*'] Assume that e^*"' < 1 and r* depends mono- 
tonically on fc% i.e. > if and only if > . Assume decreasing learning 
rate rjt. For all T > 1 and all i > 1, 

j2Etir<i\..T + ^- 

t=l 

Proof. This is a modification of the corresponding proofs in |3] and |S] . We may 
fix the randomization A and suppress it in the notation. Then we only need to 
show 

El(^^<min{4^ + ^}, (5) 
where the expectation is with respect to IFPUs randomness qi-.T- 



Assume first that the adversary is oblivious. We define an algorithm ^ as a 
variant of IFPL which samples only one perturbation vector q in the beginning 
and uses this in each time step, i.e. qt = q. Since the adversary is oblivious, A 
is equivalent to IFPL in terms of expected performance. This is all we need to 
show ©. Let 7/0 = oo and Xt = h + {k - q){^^ - ;^), then \i,t = h-.t + 
Recall {t > = {i : t > T^} . We argue by induction that for all T > 1, 

f:A^^<minAl.^ + max{i;^}. (6) 

This clearly holds for T — 0. For the induction step, we have to show 

mm Al.^ + max { ^ } + A^^^ < A^^^ + max^ { ^ } + X^^l (7) 

= min + max j^-^j. 

T+l>r ^■-'^^ T+l>r ^^+1 > 

The inequality is obvious if I^+i G {T > t}. Otherwise, let J — argmax {q* — fc* : 
i€{T>T}]. Then 

min Al.^ + max < + ^j^if <j^B, 

t=i t=i 

= EJ-/t^^' < ^vt' + max 

^t-l t — 1:1 ' rp_^^^^ I r]T+l ) 

shows JT} . Rearranging terms in © , we see 

E ^ ^l:T + max {^} (i- ^) 

t=i - - t=i 

The assertion Q - still for oblivious adversary and qt = q - then follows by 
taking expectations and using 

EminAt.T < min^l.^, + — - E^} < minl^l^, + ^i^} and (8) 

T 

EV(g-'fc)^''*(--— ) <Emax|2-^| < (9) 

t=l 

The second inequality of © holds because depends monotonically on fc* , and 
Eg* = 1, and maximality of for T < n. The second inequality of JHl can be 
proven by a simple application of the union bound, see e.g. [SJ Lem.l]. 

Sampling the perturbations qt independently is equivalent under expectation 
to sampling q only once. So assume that qt are sampled independently, i.e. that 
IFPL is played against an oblivious adversary: (O remains valid. In the last step, 
we argue that then © also holds for an adaptive adversary. This is true because 
the future actions of IFPL do not depend on its past actions, and therefore the 
adversary cannot gain from deciding after having seen IFPUs decisions. This 



argument can be made formal, as shown in 11, Lemma 12]. (Note the subtlety 
that the future actions of FoE would depend on its past actions.) □ 



Finally, we give a relation between the estimated and true losses (adapted 
from 0]). 

Lemma 6. [EL < L'"^*] For each T > 1, 6t G (0, 1), and i>l, we have 

(i) < e\.j. + \/{2lii-^)J2j=iB^ + ECT^ Bt w.p. l-^ and hence 
ill) E£l^ < t\,.r + \/(21ni)ELi4' + ^ELi Bt + ECT' B^. 
Proof. For t > r*, Xt — (-\-t — Pi-t is a martingale, since 

E[Xt|A-i] = W\Mt-i\ - i\:t = Xt-i + E[ij|A-i] - 4 = ^t-i- 

It is clear that Xr»-i < Et=i^ Moreover, |Xt — < Bt for t > t% i.e. we 
have bounded differences. By Azuma's inequality, the actual value Xt ~ X^i_i 

does not exceed ^ (2 In ^)Et=i ^ with probability 1 — This proves (i). To 
arrive at (ii), take expectations and observe that (i) fails with probability at 
most in which case i\.rp < Bt holds. □ 



We now combine the above results and derive an upper bound on the expected 
regret of FoE against an adaptive adversary. 

Theorem 7. [FoE against an adaptive adversary] Let Ei s"'"' < 1, r* depend 
monotonically on fc*, and the learning rate rjt be decreasing. Let £t be some pos- 
sibly adaptive assignment of (true) loss vectors satisfying ||^t||oo < Bt. Then for 
all experts i, we have with probability at least 1 — St 



t'-1 



e(°i<e\.,T+^+J2Bt+JjtVtB^+JjtBt+^i2\n^] 
t=i t=i t=i 

Consequently, in expectation, we have 

T T 



\ t=i \ t=i 



E£f:f < t,.,T+^+Y. Bt+Y.^tritB^+Y.^tBt 



t=l 



f=l 



t=i 



Proof. This follows by summing up all excess terms in the above lemmas. Recall 
that we only need to take expectations on both sides of the assertions of Lemmas 
1213 in order to obtain the second bound on the expectation (and we don't need 
Lemma n there). □ 



Corollary 8. Assume the conditions of Theorem^ and choose rjt = t and 
7f — . Then 

(i) Bt = l,T'=\{w')-^] ^Eef^ <e\.j. + 0{{-;^y^ +k'TiVh[T), 
(ti) Bt = l,T'=rK)-«l <i\.,T + 0{{^y'+k'TiVh^)w.p.l-T-^ 

(Hi) Bt^tTe^T'^\{w')-^^] ^E£f°|<4^y + 0((i)22 + fc'TiVbrT), and 

for all i and T > 1 (recall fc* = — Inw^). Moreover, in both cases (bounded and 
growing Bt) FoE is asymptotically optimal w.r.t. each expert, i.e. for all i, 

£FoE _ £i 

lim sup "'"•"^ — < almost surely. 

The asymptotic optimaUty is sometimes termed Hannan-consistency, in par- 
ticular if the limit equals zero. We only show the upper bound. 



Proof. Assertions {i)-{iv) follow from the previous theorem: Set St — T^^, ab- 
breviate w™'" = min{w* : t > r*}, and observe that for r* = and 
Bt = t^, we have 



m;™ = min{?i;' : T > {{w')-"]} > mm{w' : T"" < w']} > T 

t'-1 

< 

t=l 



and 



(note m;™^ > (r* - > (w*)^-")^-*)). Then (i) and (m) follow from a = 8, 
(3 = 0, and {Hi) and (iw) follow from a = 16, /3 = rj^. The asymptotic optimality 
finally follows from the Borel-Cantelli Lemma, since according to (ii) and {iv), 



//FoE mim 



< 



2^2 



for an appropriate C > 0. 



□ 



As mentioned in the first paragraph of this section, it is possible to avoid 
Lemma El thus arriving at better bounds. E.g. in (z), choosing r* = [(:^)'^], 
7t = t~i, and rjt = t^*, a regret bound of -I- k^T^) can be shown. 

Of course, also a corresponding high probability bound like (m) holds. Likewise, 
for a similar statement as {Hi), we may set = , Bt = t» , — , 

and rjt = t~i , arriving at a regret bound of -I- k^Ti) Generally, in this 

way any regret bound 0(^{~;Y + fc'T3+^) is possible, at the cost of increasing c 
where £ ^ 0. 



set t = 1 

For f = 1,2, 3, . . . 

invoke FoE{t) and play its decision for Bt basic time steps 
set i = i + Bt 



Fig. 4. The algorithm FoE, where Bt is specified in CoroUaryO 



4 Reactive environments and a universal master 
algorithm 

Regret can become a quite subtle notion if we start considering reactive envi- 
ronments, i.e. care for future consequences of a decision. An extreme case is the 
"heaven-hell" example: We have two experts, one always playing ("saying a 
prayer" ) , the other one always playing 1 ( "cursing" ) . If we always follow the first 
expert, we stay in heaven and get no loss in each step. As soon as we "curse" only 
once, we get into hell and receive maximum loss in all subsequent time steps. 
Clearly, any algorithm without prior knowledge must "fail" in this situation. 

One way to get around this problem is taking into account the actual (real- 
ization of the) game we are playing. For instance, after "cursing" once, also the 
praying expert goes to hell together with us and subsequently has maximum loss. 
Hence, were are interested in a regret defined as E£i:t — l\.rp as in the previous 
section. So what is missing? This becomes clear in the following example. 

Consider the repeated "prisoner's dilemma" against the tit-for-tat^ strategy 
p. If we use two strategies as experts, namely "always cooperate" and "always 
defect", then it is clear that always cooperating will have the best long-term 
reward. However standard expert advice or bandit master algorithm will not 
discover this, since it compares only the losses in one step, which are always lower 
for the defecting expert. To put it differently, minimizing short-term regret is 
not at all a good idea here. E.g. always defecting has no regret, while for always 
cooperating the regret grows linearly. But this is only the case for short-term 
regret, i.e. if we restrict to time intervals of length one. 

We therefore give the control to a selected expert for periods of increasing 
length. Precisely, we introduce a new time scale t (the basic time scale) at which 
we have single games with losses The master's time scale t does not coincide 
with i. Instead, at each t, the master gives control to the selected expert i 
for Bt >1 single games and receives loss il = J^fll'j^^^'^ i\- (The points i{t) in 
basic time are defined recursively, see Fig.0I) Assume that the game has bounded 
instantaneous losses P~ G [0, 1]. Then the master algorithm's instantaneous losses 

^ In the prisoner's dilemma, two players both decide independently if thy are cooper- 
ating (C) or defecting (D). If both play C, they get both a small loss, if both play 
D, they get a large loss. However, if one plays C and one D, the cooperating player 
gets a very large loss and the defecting player no loss at all. Thus defecting is a dom- 
inant strategy. Tit-for-tat plays C in the first move and afterwards the opponent's 
respective preceding move. 



are bounded bj^St. We denote the algorithm, which is completely specified 
in Fig. 01 by FoE. Then the following assertion is an easy consequence of the 
previous results. 

Corollary 9. Assume FoE plays a repeated game with bounded instantaneous 
losses i\ e [0,1]. Choose -ft = t-i, rjt = t~i , Bt = [^^J and = [(w*)"!^]. 
Then for all experts i and all T > 1, 

i^? < il^^ + 0{{^f^ + k'f^) w.p. 1 - f-^ and 

mf? < \^ + o((i)22 + ]ef^. 

Consequently, limsupy^oQ(£^'^ — €* ^)/T < a.s. The rate oj convergence is at 
least r~To , and it can he improved to T~'s~^'^ at the cost of a larger power of -~ . 

Proof. This follows from changing the time scale from t to t in Coroll ary |5 | t 
is of order i^+w. Consequently, the regret bound is 0((^)^^ + /c'Tt? \/lnT) < 



Broadly spoken, this means that FoEf performs asymptotically as well as 
the best expert. Asymptotic performance guarantees for the Strategic Experts 
Algorithm have been derived in [I]. Our results improve upon this by providing 
a rate of convergence. One can give further corollaries, e.g. in terms of flexibility 
as defined in jj. 

Since we can handle countably infinite expert classes, we may specify a uni- 
versal experts algorithm. To this aim, let expert i be derived from the ith (valid) 
program p^ of some fixed universal Turing machine. The ith program can be well- 
defined, e.g. by representing programs as binary strings and lexicographically 
ordering them "B*. Before the expert is consulted, the relevant input is written 
to the input tape of the corresponding program. If the program halts, an appro- 
priate part of the output is interpreted as the expert's recommendation. E.g. if 
the decision is binary, then the first bit suffices. (If the program does not halt, 
we may for well-definedness just fill its output tape with zeros.) Each expert is 
assigned a prior weight by = 2~'°"sth(p')^ where length(p*) is the length of 
the corresponding program and we assume the program tape to be binary. This 
construction parallels the definition of Solomonoff's universal prior jl2j . 

Corollary 10. // FoE is used together with a universal expert class as speci- 
fied above and the parameters rit,"ft, Bt, St are chosen as in Corollary\^ then 
it performs asymptotically at least as well as any computable expert i . The up- 
per bound on the rate of convergence is exponential in the complexity and 
proportional to t^w (improvable to t^s+^J, 

The universal prior has been used to define a universal agent AIXI in a quite 
different way Note that like the universal prior and the AIXI agent, our 

universal experts algorithm is not computable, since we cannot check if a the 
computation of an expert halts. On the other hand, if used with computable 



experts, the algorithm is computationaUy feasible (at each time t we need to 
consider only finitely many experts). Moreover, it is easy to impose an addi- 
tional constraint on the computation time of each expert and abort the expert's 
computation after Ct operations on the Turing machine. We may choose some 
(possibly rapidly) growing function Ct, e.g. Ct — 2*. The resulting master al- 
gorithm is fully computable and has small regret with respect to all resource 
bounded strategies. 

It is important to keep in mind that Corollaries |51 and ^| give assertions 
relative to the experts' performance merely on the acttta^^action-observation 
sequence. In other words, if we wish to assess how well FoE does, we have to 
evaluate the actual value of the best expert ^1] . Note that the whole point of our 
increasing time construction is to cause this actual value to coincide with the 
value under ideal conditions. For passive tasks, this coincidence always holds 
with any experts algorithm. With FoE, the actual and the ideal value of an 
expert coincide in many further situations, such as "finitely controllable tasks" . 
By this we mean cases where the best expert can drive the environment into some 
optimal state in a fixed finite number of time steps. An instance is the prisoner's 
dilemma with tit-for-tat being the opponent. The following is an example for a 
formalization of this statement. 

Proposition 11. Suppose FoE acts in a (fully or partially observable) Markov 
Decision Process. Let there be a computable strategy which is able to reach an 
ideal (that is optimal w.r.t. reward) state sequence in a fixed number of time 
steps. Then FoE performs asymptotically optimal. 

This statement may be generalized to cases where only a close to optimal 
state sequence is reached with high probability. However, we need assumptions 
on the closeness to optimality forji^given target probability, which are compatible 
with the sampling behavior of FoE. 

Not all environments have this or similar nice properties. As mentioned above, 
any version of FoE would not perform well in the "heaven-hell" example. The 
following is a slightly more interesting variant of the heaven-hell task, where 
we might wish to learn optimal behavior, however FoE will not. Consider the 
heaven-hell example from the beginning of this section, but assume that if at 
time t I am in hell and I "pray" for t consecutive time steps, I will get back into 
heaven. Then it is not hard to see that FoE^s exploration is so dominant that 
almost surely, FoE will eventually stay in hell. 

Simulations with some 2x2 matrix games show similar effects, depending on 
the opponent. We briefly discuss the repeated game of "chicken"'*. In this game, 

* This game, also known as "Hawk and Dove", can be interpreted as follows. Two 
coauthors write a paper, but each tries to spend as little effort as possible. If one 
succeeds to let the other do the whole work, he has a high reward. On the other hand, 
if no one does anything, there will be no paper and thus no reward. Finally, if both 
decide to cooperate, both get some reward. We choose the loss matrix as (^^g the 
learner is the column player, the opponent's loss matrix is the transpose, choosing 
the fist column means to defect, the second to cooperate. Hence, in the repeated 
game, it is socially optimal to take turns cooperating and defecting. 



it is desirable for the learner to become the "dominant defector" , i.e. to defect in 
the majority of the cases while the opponent cooperates. Let's call an opponent 
"primitive" if he agrees to cooperate after a fixed number of consecutive defecting 
moves of FoE, and let's call him "stubborn" if this number is high. Then FoE 
learns to be the dominant defector against any primitive opponent, however 
stubborn. On the other hand, if the opponent is some learning strategy which also 
tries to be the dominant defector and learns faster (we conducted the experiment 
with AIXI [Hj), then FoE settles for cooperating, and the opponent will be the 
dominant defector. Interestingly however, AIXI would not learn to defect against 
a stubborn primitive opponent. Under this point of view, it seems questionable 
that there is something like a universally optimal balance of exploration vs. 
exploitation in active learning at all. 

5 Discussion 

An alternative argument for adaptive adversary. As mentioned in the 
beginning of Sectional the analysis we gave uses a trick from [Ij. Such a trick 
seems necessary, as the basic FPL analysis only works for oblivious adversary. 
The simple argument from 11 which we used in the last paragraph of the proof 
of Lemma El works only for full observation games (note that considering the 
estimated losses, we were actually dealing with full observations there). In order 
to obtain a similar result in the partial observation case, we may argue as fol- 
lows. We let the game proceed for T time steps with independent randomization 
against an adaptive adversary. Then we analyze FoE^s performance in retro- 
spective. In particular, we note that for the losses assigned by the adversary, 
FoE^s expected regret coincides with the regret of another, virtual algorithm, 
which uses (in its FPL subroutine) identical perturbations qt = q. Performing 
the analysis for this virtual algorithm, we arrive at the desired assertion, however 
without needing Lemma El This results in tighter bounds as stated above. The 
argument is formally elaborated in jH]. 

Actual learning speed and lovi^er bounds. In practice, the bounds we have 
proven seem irrelevant except for small expert classes, although asserting almost 
sure optimality and even a convergence rate. The exponential of the complexity 
— may be huge. Imagine for instance a moderately complex task and some good 
strategy, which can be coded with mere 500 bits. Then its prior weig ht is 2-500, a 
constant which is not distinguishable from zero in all practical situations. Thus, 
it seems that the bounds can be relevant at most for small expert classes with 
uniform prior. This is a general shortcoming of bandit style experts algorithms: 
For uniform prior a lower bound on the expected loss which scales with ^/n 
(where n is the size of the expert class) has been proven |10| . 

In order to get a lower bound on FoE's regret in the time T, observe that 
FoE is a label-efficient learner jl5ll(ij : According to the definition in |lf>j . we 
may assume that in each exploration step, we incur maximal loss Bt- It is im- 
mediate that the same analysis then still holds. For label-efficient prediction, 
Cesa-Bianchi et al. ^Hl have shown a lower regret bound of 0(T3). Since ac- 
cording to the remark at the end of Section we have an upper bound of 



0({~Y + fc*T3+'^), this is almost tight except for the additive {^Y term. It is 
an open problem to state a lower bound simultaneously tight in both ~ and T. 

Even if the bounds, in particular seem not practical, maybe FoE would 
learn sufficiently quickly in practice anyway? We believe that this is not so in 
most cases: The design of FoE is too much tailored towards worst-case environ- 
ments, FoE is too defensive. Assume that we have a "good" and a "bad" expert, 
and FoE learns this fact after some time. Then it still would spend a relatively 
huge fraction of jt = to exploring the bad expert. Such defensive behavior 
seems only acceptable if wc arc already starting with a class of good experts. 
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